Goto

Collaborating Authors

 ai-generated data


Understanding nature and nurture: Statistical and AI innovations uncover how genes and environment shape human health Science

Science

What makes us who we are? Is it our DNA, passed down through generations, or the environment that shapes our lives? This question--how nature and nurture combine to influence health and behavior--has long captured my curiosity. As I grew up in a multigenerational household, I was struck by the story of my two uncles, identical twins who were genetically indistinguishable but who lived out very different health journeys. One developed severe cardiovascular disease by his early forties; the other stayed healthy into his sixties. What separated them was not biology--it was environment.


Psittacines of Innovation? Assessing the True Novelty of AI Creations

Mukherjee, Anirban

arXiv.org Artificial Intelligence

We examine whether Artificial Intelligence (AI) systems generate truly novel ideas rather than merely regurgitating patterns learned during training. Utilizing a novel experimental design, we task an AI with generating project titles for hypothetical crowdfunding campaigns. We compare within AI-generated project titles, measuring repetition and complexity. We compare between the AI-generated titles and actual observed field data using an extension of maximum mean discrepancy--a metric derived from the application of kernel mean embeddings of statistical distributions to high-dimensional machine learning (large language) embedding vectors--yielding a structured analysis of AI output novelty. Results suggest that (1) the AI generates unique content even under increasing task complexity, and at the limits of its computational capabilities, (2) the generated content has face validity, being consistent with both inputs to other generative AI and in qualitative comparison to field data, and (3) exhibits divergence from field data, mitigating concerns relating to intellectual property rights. We discuss implications for copyright and trademark law.


A Tale of Tails: Model Collapse as a Change of Scaling Laws

Dohmatob, Elvis, Feng, Yunzhen, Yang, Pu, Charton, Francois, Kempe, Julia

arXiv.org Artificial Intelligence

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.


The perpetual motion machine of AI-generated data and the distraction of ChatGPT-as-scientist

Listgarten, Jennifer

arXiv.org Artificial Intelligence

Since ChatGPT works so well, are we on the cusp of solving science with AI? Is not AlphaFold2 suggestive that the potential of LLMs in biology and the sciences more broadly is limitless? Can we use AI itself to bridge the lack of data in the sciences in order to then train an AI? Herein we present a discussion of these topics.


ChatGPT generates fake data set to support scientific hypothesis

Nature

The artificial-intelligence model that powers ChatGPT can create superficially plausible scientific data sets.Credit: Mateusz Slodkowski/SOPA Images/LightRocket via Getty Researchers have used the technology behind the artificial intelligence (AI) chatbot ChatGPT to create a fake clinical-trial data set to support an unverified scientific claim. In a paper published in JAMA Ophthalmology on 9 November1, the authors used GPT-4 -- the latest version of the large language model on which ChatGPT runs -- paired with Advanced Data Analysis (ADA), a model that incorporates the programming language Python and can perform statistical analysis and create data visualizations. The AI-generated data compared the outcomes of two surgical procedures and indicated -- wrongly -- that one treatment is better than the other. "Our aim was to highlight that, in a few minutes, you can create a data set that is not supported by real original data, and it is also opposite or in the other direction compared to the evidence that are available," says study co-author Giuseppe Giannaccare, an eye surgeon at the University of Cagliari in Italy. The ability of AI to fabricate convincing data adds to concern among researchers and journal editors about research integrity.


AI Is an Existential Threat to Itself

The Atlantic - Technology

In the beginning, the chatbots and their ilk fed on the human-made internet. Various generative-AI models of the sort that power ChatGPT got their start by devouring data from sites including Wikipedia, Getty, and Scribd. They consumed text, images, and other content, learning through algorithmic digestion their flavors and texture, which ingredients go well together and which do not, in order to concoct their own art and writing. Generative AI is utterly reliant on the sustenance it gets from the web: Computers mime intelligence by processing almost unfathomable amounts of data and deriving patterns from them. ChatGPT can write a passable high-school essay because it has read libraries' worth of digitized books and articles, while DALL-E 2 can produce Picasso-esque images because it has analyzed something like the entire trajectory of art history.


The top 5 open-source tools for visualizing AI-generated data

#artificialintelligence

The ability to build artificial intelligence (AI) or machine-learning (ML) models is moving quickly away from the data scientist's domain and toward the citizen developer. Creating results from AI is getting easier, thanks to open-source tools that can convert AI/ML data streams into clear information that drives visualizations. It's essential to visualize AI and ML data in a way that helps you draw insights and find trends and patterns. The quality and quantity of the data available to you are critical factors. A visual representation should have some basic features.